Acerca del set de datos¶

Los datos están relacionados con campañas de marketing directo (llamadas telefónicas) de una institución bancaria portuguesa. El objetivo de la clasificación es predecir si el cliente suscribirá un depósito a plazo (variable y). Los datos están relacionados con campañas de marketing directo de una institución bancaria portuguesa. A menudo, se requería más de un contacto con el mismo cliente, para poder acceder si el producto (depósito bancario a plazo) estaría ('sí') o no ('no') suscrito. Por tanto, el objetivo de la clasificación es predecir si el cliente suscribirá (sí/no) un depósito a plazo (variable y).

In [1]:
pip install shap
Requirement already satisfied: shap in c:\users\l03534795\anaconda3\lib\site-packages (0.41.0)
Requirement already satisfied: scipy in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (1.9.1)
Requirement already satisfied: cloudpickle in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (2.0.0)
Requirement already satisfied: numpy in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (1.21.5)
Requirement already satisfied: scikit-learn in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (1.2.1)
Requirement already satisfied: tqdm>4.25.0 in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (4.64.1)
Requirement already satisfied: numba in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (0.55.1)
Requirement already satisfied: packaging>20.9 in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (21.3)
Requirement already satisfied: pandas in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (1.4.4)
Requirement already satisfied: slicer==0.0.7 in c:\users\l03534795\anaconda3\lib\site-packages (from shap) (0.0.7)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\users\l03534795\anaconda3\lib\site-packages (from packaging>20.9->shap) (3.0.9)
Requirement already satisfied: colorama in c:\users\l03534795\anaconda3\lib\site-packages (from tqdm>4.25.0->shap) (0.4.6)
Requirement already satisfied: setuptools in c:\users\l03534795\anaconda3\lib\site-packages (from numba->shap) (63.4.1)
Requirement already satisfied: llvmlite<0.39,>=0.38.0rc1 in c:\users\l03534795\anaconda3\lib\site-packages (from numba->shap) (0.38.0)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\l03534795\anaconda3\lib\site-packages (from pandas->shap) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\l03534795\anaconda3\lib\site-packages (from pandas->shap) (2022.1)
Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\l03534795\anaconda3\lib\site-packages (from scikit-learn->shap) (2.2.0)
Requirement already satisfied: joblib>=1.1.1 in c:\users\l03534795\anaconda3\lib\site-packages (from scikit-learn->shap) (1.2.0)
Requirement already satisfied: six>=1.5 in c:\users\l03534795\anaconda3\lib\site-packages (from python-dateutil>=2.8.1->pandas->shap) (1.16.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
sns.set_theme(color_codes=True)


import warnings 
warnings.filterwarnings("ignore")
In [3]:
df = pd.read_csv("bank-additional-full.csv", delimiter=";")
In [4]:
pd.set_option("display.max_columns", None)
df.head()
Out[4]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 housemaid married basic.4y no no no telephone may mon 261 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
1 57 services married high.school unknown no no telephone may mon 149 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
2 37 services married high.school no yes no telephone may mon 226 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
3 40 admin. married basic.6y no no no telephone may mon 151 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
4 56 services married high.school no no yes telephone may mon 307 1 999 0 nonexistent 1.1 93.994 -36.4 4.857 5191.0 no
In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41188 entries, 0 to 41187
Data columns (total 21 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   age             41188 non-null  int64  
 1   job             41188 non-null  object 
 2   marital         41188 non-null  object 
 3   education       41188 non-null  object 
 4   default         41188 non-null  object 
 5   housing         41188 non-null  object 
 6   loan            41188 non-null  object 
 7   contact         41188 non-null  object 
 8   month           41188 non-null  object 
 9   day_of_week     41188 non-null  object 
 10  duration        41188 non-null  int64  
 11  campaign        41188 non-null  int64  
 12  pdays           41188 non-null  int64  
 13  previous        41188 non-null  int64  
 14  poutcome        41188 non-null  object 
 15  emp.var.rate    41188 non-null  float64
 16  cons.price.idx  41188 non-null  float64
 17  cons.conf.idx   41188 non-null  float64
 18  euribor3m       41188 non-null  float64
 19  nr.employed     41188 non-null  float64
 20  y               41188 non-null  object 
dtypes: float64(5), int64(5), object(11)
memory usage: 6.6+ MB
In [6]:
df.isnull().sum()
Out[6]:
age               0
job               0
marital           0
education         0
default           0
housing           0
loan              0
contact           0
month             0
day_of_week       0
duration          0
campaign          0
pdays             0
previous          0
poutcome          0
emp.var.rate      0
cons.price.idx    0
cons.conf.idx     0
euribor3m         0
nr.employed       0
y                 0
dtype: int64

Análisis exploratorio de los datos¶

In [7]:
#Seleccionar datos categoricos 
df_categoricos = df[["job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", 
                    "poutcome", "y"]]
df_categoricos.head()
Out[7]:
job marital education default housing loan contact month day_of_week poutcome y
0 housemaid married basic.4y no no no telephone may mon nonexistent no
1 services married high.school unknown no no telephone may mon nonexistent no
2 services married high.school no yes no telephone may mon nonexistent no
3 admin. married basic.6y no no no telephone may mon nonexistent no
4 services married high.school no no yes telephone may mon nonexistent no
In [8]:
#Seleccionar datos númericos
df_numericos = df[["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]]
df_numericos.head()
Out[8]:
age duration campaign pdays previous emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed
0 56 261 1 999 0 1.1 93.994 -36.4 4.857 5191.0
1 57 149 1 999 0 1.1 93.994 -36.4 4.857 5191.0
2 37 226 1 999 0 1.1 93.994 -36.4 4.857 5191.0
3 40 151 1 999 0 1.1 93.994 -36.4 4.857 5191.0
4 56 307 1 999 0 1.1 93.994 -36.4 4.857 5191.0
In [9]:
# Documentar sns.countplot: https://seaborn.pydata.org/generated/seaborn.countplot.html

cat_vars = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome"]

#Crear figuras con subplots
fig, axs = plt.subplots(nrows=2, ncols=5, figsize = (20, 10))
axs = axs.flatten()

#Crear un countplot para cada variable categorica 
for i, var in enumerate (cat_vars):
    sns.countplot(x=var, hue="y", data = df_categoricos, ax=axs[i])
    axs[i].set_xticklabels(axs[i].get_xticklabels(), rotation = 90)
    
#Ajustar espacio entre subplots
fig.tight_layout()

#Mostrar el plot
plt.show()
In [10]:
# Documentar seaborn.histplot: https://seaborn.pydata.org/generated/seaborn.histplot.html

#Lista  de variables categoricas 
cat_vars = ["job", "marital", "education", "default", "housing", "loan", "contact", "month", "day_of_week", "poutcome"]

#Crear figuras con subplots
fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

#Crear histogramas para cada variable categorica 
for i, var in enumerate (cat_vars):
    sns.histplot(x=var, hue="y", data = df_categoricos, ax=axs[i], multiple = "fill", kde = False, element = "bars", fill= True, stat = "density")
    axs[i].set_xticklabels(df_categoricos[var].unique(), rotation=90)
    axs[i].set_xlabel(var)
    
#Ajustar el especio entre subplots
fig.tight_layout()

#Mostrar el plot
plt.show()
  1. La mayoría de las personas que suscriben los depósitos bancarios a plazo son: jubilados y estudiantes.

  2. La mayoría de las personas que suscriben los depósitos bancarios a plazo son cantactados por vía celular.

  3. La mayoría de las personas que suscriben los depósitos bancarios a plazo tienen su último contacto en: octubre, diciembre, marzo, septiembre

  4. La mayoría de las personas que suscriben el depósito bancario a plazo han valorado exitosamente la campaña de marketing.

Resaltar algunos aspectos de la distribución de los datos¶

In [11]:
# Documentar sns.boxplot: https://seaborn.pydata.org/generated/seaborn.boxplot.html

num_vars = ["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars): 
    sns.boxplot(x=var, data=df, ax=axs[i])
    
fig.tight_layout()
    
plt.show()
In [12]:
# Documentar sns.violinplot: https://seaborn.pydata.org/generated/seaborn.violinplot.html

num_vars = ["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars): 
    sns.violinplot(x=var, data=df, ax=axs[i])
    
fig.tight_layout()
    
plt.show()
In [13]:
num_vars = ["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars): 
    sns.violinplot(x=var, y="y", data=df, ax=axs[i])
    
fig.tight_layout()
    
plt.show()
In [14]:
num_vars = ["age", "duration", "campaign", "pdays", "previous", "emp.var.rate", "cons.price.idx", 
                  "cons.conf.idx", "euribor3m", "nr.employed"]

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars): 
    sns.histplot(x=var, data=df, ax=axs[i])
    
fig.tight_layout()
    
plt.show()
In [15]:
num_vars = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 
            'cons.conf.idx', 'euribor3m', 'nr.employed']

fig, axs = plt.subplots(nrows=2, ncols=5, figsize=(20, 10))
axs = axs.flatten()

for i, var in enumerate(num_vars):
    sns.histplot(x=var, hue='y', data=df, ax=axs[i], multiple="stack")

fig.tight_layout()

plt.show()
In [16]:
# Documentar sns.pairplot: https://seaborn.pydata.org/generated/seaborn.pairplot.html

# Listado de variables númericas. 
num_vars = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 
            'cons.conf.idx', 'euribor3m', 'nr.employed']

# Crear una Matriz para los diagramas de dispersión
sns.pairplot(df, hue='y')
Out[16]:
<seaborn.axisgrid.PairGrid at 0x1b178219400>

Procesamiento de los datos¶

Utilizar pandas.Series.unique¶

Devuelve valores únicos de una serie de objetos.

Los valor únicos se devuelven en orden de aparición. Los valores Únicos se basan en tablas hash, por lo tanto, NO se ordenan.

https://pandas.pydata.org/docs/reference/api/pandas.Series.unique.html

In [17]:
df['job'].unique()
Out[17]:
array(['housemaid', 'services', 'admin.', 'blue-collar', 'technician',
       'retired', 'management', 'unemployed', 'self-employed', 'unknown',
       'entrepreneur', 'student'], dtype=object)
In [18]:
df['marital'].unique()
Out[18]:
array(['married', 'single', 'divorced', 'unknown'], dtype=object)
In [19]:
df['education'].unique()
Out[19]:
array(['basic.4y', 'high.school', 'basic.6y', 'basic.9y',
       'professional.course', 'unknown', 'university.degree',
       'illiterate'], dtype=object)
In [20]:
df['default'].unique()
Out[20]:
array(['no', 'unknown', 'yes'], dtype=object)
In [21]:
df['housing'].unique()
Out[21]:
array(['no', 'yes', 'unknown'], dtype=object)
In [22]:
df['loan'].unique()
Out[22]:
array(['no', 'yes', 'unknown'], dtype=object)
In [23]:
df['contact'].unique()
Out[23]:
array(['telephone', 'cellular'], dtype=object)
In [24]:
df['month'].unique()
Out[24]:
array(['may', 'jun', 'jul', 'aug', 'oct', 'nov', 'dec', 'mar', 'apr',
       'sep'], dtype=object)
In [25]:
df['day_of_week'].unique()
Out[25]:
array(['mon', 'tue', 'wed', 'thu', 'fri'], dtype=object)
In [26]:
df['poutcome'].unique()
Out[26]:
array(['nonexistent', 'failure', 'success'], dtype=object)
In [27]:
df['y'].unique()
Out[27]:
array(['no', 'yes'], dtype=object)

Transformar datos¶

El paquete sklearn.preprocessing proporciona varias funciones comunes que son de utilidad en la transformación de clases para cambiar los vectores de características en una representación que sea más adecuada para los estimadores posteriores.¶

En general, los algoritmos de aprendizaje se benefician de la estandarización del conjunto de datos.¶

https://scikit-learn.org/stable/modules/preprocessing.html

In [28]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['job']= label_encoder.fit_transform(df['job'])
df['job'].unique()
Out[28]:
array([ 3,  7,  0,  1,  9,  5,  4, 10,  6, 11,  2,  8])
In [29]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['marital']= label_encoder.fit_transform(df['marital'])
df['marital'].unique()
Out[29]:
array([1, 2, 0, 3])
In [30]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['education']= label_encoder.fit_transform(df['education'])
df['education'].unique()
Out[30]:
array([0, 3, 1, 2, 5, 7, 6, 4])
In [31]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['default']= label_encoder.fit_transform(df['default'])
df['default'].unique()
Out[31]:
array([0, 1, 2])
In [32]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['housing']= label_encoder.fit_transform(df['housing'])
df['housing'].unique()
Out[32]:
array([0, 2, 1])
In [33]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['loan']= label_encoder.fit_transform(df['loan'])
df['loan'].unique()
Out[33]:
array([0, 2, 1])
In [34]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['contact']= label_encoder.fit_transform(df['contact'])
df['contact'].unique()
Out[34]:
array([1, 0])
In [35]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['month']= label_encoder.fit_transform(df['month'])
df['month'].unique()
Out[35]:
array([6, 4, 3, 1, 8, 7, 2, 5, 0, 9])
In [36]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['day_of_week']= label_encoder.fit_transform(df['day_of_week'])
df['day_of_week'].unique()
Out[36]:
array([1, 3, 4, 2, 0])
In [37]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['poutcome']= label_encoder.fit_transform(df['poutcome'])
df['poutcome'].unique()
Out[37]:
array([1, 0, 2])
In [38]:
from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()
df['y']= label_encoder.fit_transform(df['y'])
df['y'].unique()
Out[38]:
array([0, 1])
In [39]:
df.head()
Out[39]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
0 56 3 1 0 0 0 0 1 6 1 261 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0
1 57 7 1 3 1 0 0 1 6 1 149 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0
2 37 7 1 3 0 2 0 1 6 1 226 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0
3 40 0 1 1 0 0 0 1 6 1 151 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0
4 56 7 1 3 0 0 2 1 6 1 307 1 999 0 1 1.1 93.994 -36.4 4.857 5191.0 0

Balancear las etiquetas:¶

"Y" Label

Insertar un gráfico de barras para mostrar los conteos de observaciones en cada categeria usando barras.¶

In [40]:
sns.countplot(df['y'])
df['y'].value_counts()
Out[40]:
0    36548
1     4640
Name: y, dtype: int64

Usar función resample https://scikit-learn.org/stable/modules/generated/sklearn.utils.resample.html¶

In [41]:
from sklearn.utils import resample
#Crear dos diferentes dataframe de una clase mayoritaria y minoritaria
df_majority = df[(df['y']==0)] 
df_minority = df[(df['y']==1)] 

# muestreo ascendente de la clase minoritaria
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # muesta con reemplazo 
                                 n_samples= 36548, # para que coincida con la clase mayoritaria
                                 random_state=0)   # resultados reproducible

# Combinar la clase mayoritaria con la muestra ascendente de la clase minoritaria 
df_upsampled = pd.concat([df_minority_upsampled, df_majority])
In [42]:
sns.countplot(df_upsampled['y'])
df_upsampled['y'].value_counts()
Out[42]:
1    36548
0    36548
Name: y, dtype: int64

Eliminar outliers usando IQR¶

Detectar outlier es tedioso, especialmente cuando se tienen multiples tipos de datos.

Por lo tanto, tenemos diferentes formas de detectar valores atípicos para diferentes tipos de datos.

En cuanto a los datos distribuidos normalmente, podemos obtener el método Z-Score;

Para skewed data, se usa IQR.

IQR es la diferencia entre el cuartil 75th and 25th.¶

In [43]:
def remove_outliers_iqr(df, columns):
    for col in columns:
        q1 = df[col].quantile(0.25)
        q3 = df[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]
    return df

# Señale las columnas para remover los outliers
columns_to_check = ['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 
                    'euribor3m', 'nr.employed']

# Solicitar la función que remueve los outliers usando IQR
df_clean = remove_outliers_iqr(df_upsampled, columns_to_check)

# Mostrar el resultado en el dataframe
df_clean.head()
Out[43]:
age job marital education default housing loan contact month day_of_week duration campaign pdays previous poutcome emp.var.rate cons.price.idx cons.conf.idx euribor3m nr.employed y
37017 25 8 2 7 1 2 0 0 3 3 371 1 999 0 1 -2.9 92.469 -33.6 1.044 5076.2 1
36682 51 9 2 6 0 0 0 0 4 0 657 1 999 0 1 -2.9 92.963 -40.8 1.268 5076.2 1
29384 45 7 2 7 0 0 0 1 0 0 541 1 999 0 1 -1.8 93.075 -47.1 1.405 5099.1 1
21998 29 9 2 3 1 0 0 0 1 4 921 3 999 0 1 1.4 93.444 -36.1 4.964 5228.1 1
16451 37 10 2 2 1 2 2 0 3 4 633 1 999 0 1 1.4 93.918 -42.7 4.963 5228.1 1
In [44]:
df_clean.shape
Out[44]:
(49702, 21)

Correlación mostrando un heatmap¶

Seaborn es una biblioteca de python que permite hacer mejores gráficos fácilmente gracias a su función heatmap(). Un mapa de calor es una representación gráfica de datos donde cada valor de una matriz se representa como un color.

https://seaborn.pydata.org/generated/seaborn.heatmap.html

In [45]:
plt.figure(figsize=(20, 16))
sns.heatmap(df_clean.corr(), fmt='.2g', annot=True)
Out[45]:
<AxesSubplot:>

Definiendo vector de características (X) y variable target (y)¶

In [46]:
X = df_clean.drop('y', axis=1)
y = df_clean['y']

Dividir arrais o matrices en subconjuntos aleatorios de entrenamiento y prueba.¶

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Para ser precisos, el método split() genera los índices de entrenamiento y prueba, no los datos en si mismos.

Tener múltiples divisiones puede ser útil si desea estimar mejor el rendimiento de su modelo.

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3,random_state=0)

Instanciar el modelo (Decision tree)¶

  1. Instanciamos el modelo con criterio gini index
In [48]:
from sklearn.tree import DecisionTreeClassifier
clf_gini = DecisionTreeClassifier(criterion='gini', max_depth=3, random_state=0)

clf_gini.fit(X_train, y_train)
Out[48]:
DecisionTreeClassifier(max_depth=3, random_state=0)
In [49]:
#Hacemos predicción con CLF Gini
y_pred_gini = clf_gini.predict(X_test)
In [50]:
#Estimamos precisión 
print('Precisión en el set de Entrenamiento: {:.2f}'
     .format(clf_gini.score(X_train, y_train)))
print('Precisión en el set de Test: {:.2f}'
     .format(clf_gini.score(X_test, y_test)))
Precisión en el set de Entrenamiento: 0.85
Precisión en el set de Test: 0.85

Metricas de desempeño¶

Accurracy score con criterio Gini Index

In [51]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, jaccard_score
print('F-1 Score : ',(f1_score(y_test, y_pred_gini, average='micro')))
print('Precision Score : ',(precision_score(y_test, y_pred_gini, average='micro')))
print('Recall Score : ',(recall_score(y_test, y_pred_gini, average='micro')))
print('Jaccard Score : ',(jaccard_score(y_test, y_pred_gini, average='micro')))
F-1 Score :  0.8468915565689761
Precision Score :  0.846891556568976
Recall Score :  0.846891556568976
Jaccard Score :  0.7344422472955682
In [52]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve
print (classification_report(y_test, y_pred_gini))
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      8875
           1       0.81      0.81      0.81      6036

    accuracy                           0.85     14911
   macro avg       0.84      0.84      0.84     14911
weighted avg       0.85      0.85      0.85     14911

In [53]:
# Matriz de confusión
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_gini)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='Blues')
plt.figure(figsize=(9,9))
Out[53]:
<Figure size 900x900 with 0 Axes>
<Figure size 900x900 with 0 Axes>

Visualización del árbol¶

In [54]:
plt.figure(figsize=(12,8))

from sklearn import tree

tree.plot_tree(clf_gini.fit(X_train, y_train))
Out[54]:
[Text(0.5, 0.875, 'X[10] <= 362.5\ngini = 0.48\nsamples = 34791\nvalue = [20859, 13932]'),
 Text(0.25, 0.625, 'X[18] <= 3.168\ngini = 0.34\nsamples = 22575\nvalue = [17663, 4912]'),
 Text(0.125, 0.375, 'X[19] <= 5087.65\ngini = 0.495\nsamples = 8226\nvalue = [3702, 4524]'),
 Text(0.0625, 0.125, 'gini = 0.352\nsamples = 3618\nvalue = [825, 2793]'),
 Text(0.1875, 0.125, 'gini = 0.469\nsamples = 4608\nvalue = [2877, 1731]'),
 Text(0.375, 0.375, 'X[8] <= 7.5\ngini = 0.053\nsamples = 14349\nvalue = [13961, 388]'),
 Text(0.3125, 0.125, 'gini = 0.03\nsamples = 14167\nvalue = [13949, 218]'),
 Text(0.4375, 0.125, 'gini = 0.123\nsamples = 182\nvalue = [12, 170]'),
 Text(0.75, 0.625, 'X[10] <= 525.5\ngini = 0.386\nsamples = 12216\nvalue = [3196, 9020]'),
 Text(0.625, 0.375, 'X[18] <= 2.916\ngini = 0.492\nsamples = 4384\nvalue = [1913, 2471]'),
 Text(0.5625, 0.125, 'gini = 0.309\nsamples = 2083\nvalue = [398, 1685]'),
 Text(0.6875, 0.125, 'gini = 0.45\nsamples = 2301\nvalue = [1515, 786]'),
 Text(0.875, 0.375, 'X[18] <= 1.402\ngini = 0.274\nsamples = 7832\nvalue = [1283, 6549]'),
 Text(0.8125, 0.125, 'gini = 0.136\nsamples = 2315\nvalue = [170, 2145]'),
 Text(0.9375, 0.125, 'gini = 0.322\nsamples = 5517\nvalue = [1113, 4404]')]

Visualizar el árbol con graphviz¶

In [55]:
#Instalar Graphviz en Python pip install graphviz 
import graphviz 
import pydotplus
%matplotlib inline
In [56]:
dot_data = tree.export_graphviz(clf_gini, out_file=None, max_depth=None,
                                feature_names=X_train.columns,  
                                class_names=True,  
                                filled=True, rotate=True, rounded=True,  
                                special_characters=True)

graph = graphviz.Source(dot_data) 
graph 
Out[56]:
Tree 0 duration ≤ 362.5 gini = 0.48 samples = 34791 value = [20859, 13932] class = y 0 1 euribor3m ≤ 3.168 gini = 0.34 samples = 22575 value = [17663, 4912] class = y 0 0->1 True 8 duration ≤ 525.5 gini = 0.386 samples = 12216 value = [3196, 9020] class = y 1 0->8 False 2 nr.employed ≤ 5087.65 gini = 0.495 samples = 8226 value = [3702, 4524] class = y 1 1->2 5 month ≤ 7.5 gini = 0.053 samples = 14349 value = [13961, 388] class = y 0 1->5 3 gini = 0.352 samples = 3618 value = [825, 2793] class = y 1 2->3 4 gini = 0.469 samples = 4608 value = [2877, 1731] class = y 0 2->4 6 gini = 0.03 samples = 14167 value = [13949, 218] class = y 0 5->6 7 gini = 0.123 samples = 182 value = [12, 170] class = y 1 5->7 9 euribor3m ≤ 2.916 gini = 0.492 samples = 4384 value = [1913, 2471] class = y 1 8->9 12 euribor3m ≤ 1.402 gini = 0.274 samples = 7832 value = [1283, 6549] class = y 1 8->12 10 gini = 0.309 samples = 2083 value = [398, 1685] class = y 1 9->10 11 gini = 0.45 samples = 2301 value = [1515, 786] class = y 0 9->11 13 gini = 0.136 samples = 2315 value = [170, 2145] class = y 1 12->13 14 gini = 0.322 samples = 5517 value = [1113, 4404] class = y 1 12->14
  1. Instanciamos el modelo con criterio gini entropy
In [57]:
from sklearn.tree import DecisionTreeClassifier
clf_en = DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)
clf_en.fit(X_train, y_train)
Out[57]:
DecisionTreeClassifier(criterion='entropy', max_depth=3, random_state=0)

Predecir con test set para el criterio entropy

In [58]:
#Hacemos predicciones
y_pred_en = clf_en.predict(X_test)
In [59]:
print('Precisión en el set de Entrenamiento: {:.2f}'
     .format(clf_en.score(X_train, y_train)))
print('Precisión en el set de Test: {:.2f}'
     .format(clf_en.score(X_test, y_test)))
Precisión en el set de Entrenamiento: 0.85
Precisión en el set de Test: 0.85

Metricas de desempeño¶

Accurracy score con criterio Entropy

In [60]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, jaccard_score
print('F-1 Score : ',(f1_score(y_test, y_pred_en, average='micro')))
print('Precision Score : ',(precision_score(y_test, y_pred_en, average='micro')))
print('Recall Score : ',(recall_score(y_test, y_pred_en, average='micro')))
print('Jaccard Score : ',(jaccard_score(y_test, y_pred_en, average='micro')))
F-1 Score :  0.8468915565689761
Precision Score :  0.846891556568976
Recall Score :  0.846891556568976
Jaccard Score :  0.7344422472955682
In [61]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve
print (classification_report(y_test, y_pred_en))
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      8875
           1       0.81      0.81      0.81      6036

    accuracy                           0.85     14911
   macro avg       0.84      0.84      0.84     14911
weighted avg       0.85      0.85      0.85     14911

In [62]:
# Matriz de confusión
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred_en)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='Reds')
plt.figure(figsize=(9,9))
Out[62]:
<Figure size 900x900 with 0 Axes>
<Figure size 900x900 with 0 Axes>
In [63]:
plt.figure(figsize=(12,8))

from sklearn import tree

tree.plot_tree(clf_en.fit(X_train, y_train))
Out[63]:
[Text(0.5, 0.875, 'X[10] <= 362.5\nentropy = 0.971\nsamples = 34791\nvalue = [20859, 13932]'),
 Text(0.25, 0.625, 'X[18] <= 3.168\nentropy = 0.756\nsamples = 22575\nvalue = [17663, 4912]'),
 Text(0.125, 0.375, 'X[19] <= 5087.65\nentropy = 0.993\nsamples = 8226\nvalue = [3702, 4524]'),
 Text(0.0625, 0.125, 'entropy = 0.775\nsamples = 3618\nvalue = [825, 2793]'),
 Text(0.1875, 0.125, 'entropy = 0.955\nsamples = 4608\nvalue = [2877, 1731]'),
 Text(0.375, 0.375, 'X[8] <= 7.5\nentropy = 0.179\nsamples = 14349\nvalue = [13961, 388]'),
 Text(0.3125, 0.125, 'entropy = 0.115\nsamples = 14167\nvalue = [13949, 218]'),
 Text(0.4375, 0.125, 'entropy = 0.351\nsamples = 182\nvalue = [12, 170]'),
 Text(0.75, 0.625, 'X[10] <= 525.5\nentropy = 0.829\nsamples = 12216\nvalue = [3196, 9020]'),
 Text(0.625, 0.375, 'X[18] <= 2.916\nentropy = 0.988\nsamples = 4384\nvalue = [1913, 2471]'),
 Text(0.5625, 0.125, 'entropy = 0.704\nsamples = 2083\nvalue = [398, 1685]'),
 Text(0.6875, 0.125, 'entropy = 0.926\nsamples = 2301\nvalue = [1515, 786]'),
 Text(0.875, 0.375, 'X[18] <= 1.402\nentropy = 0.643\nsamples = 7832\nvalue = [1283, 6549]'),
 Text(0.8125, 0.125, 'entropy = 0.379\nsamples = 2315\nvalue = [170, 2145]'),
 Text(0.9375, 0.125, 'entropy = 0.725\nsamples = 5517\nvalue = [1113, 4404]')]
In [64]:
dot_data = tree.export_graphviz(clf_en, out_file=None, max_depth=None,
                                feature_names=X_train.columns,  
                                class_names=True,  
                                filled=True, rotate=True, rounded=True,  
                                special_characters=True)

graph = graphviz.Source(dot_data) 
graph 
Out[64]:
Tree 0 duration ≤ 362.5 entropy = 0.971 samples = 34791 value = [20859, 13932] class = y 0 1 euribor3m ≤ 3.168 entropy = 0.756 samples = 22575 value = [17663, 4912] class = y 0 0->1 True 8 duration ≤ 525.5 entropy = 0.829 samples = 12216 value = [3196, 9020] class = y 1 0->8 False 2 nr.employed ≤ 5087.65 entropy = 0.993 samples = 8226 value = [3702, 4524] class = y 1 1->2 5 month ≤ 7.5 entropy = 0.179 samples = 14349 value = [13961, 388] class = y 0 1->5 3 entropy = 0.775 samples = 3618 value = [825, 2793] class = y 1 2->3 4 entropy = 0.955 samples = 4608 value = [2877, 1731] class = y 0 2->4 6 entropy = 0.115 samples = 14167 value = [13949, 218] class = y 0 5->6 7 entropy = 0.351 samples = 182 value = [12, 170] class = y 1 5->7 9 euribor3m ≤ 2.916 entropy = 0.988 samples = 4384 value = [1913, 2471] class = y 1 8->9 12 euribor3m ≤ 1.402 entropy = 0.643 samples = 7832 value = [1283, 6549] class = y 1 8->12 10 entropy = 0.704 samples = 2083 value = [398, 1685] class = y 1 9->10 11 entropy = 0.926 samples = 2301 value = [1515, 786] class = y 0 9->11 13 entropy = 0.379 samples = 2315 value = [170, 2145] class = y 1 12->13 14 entropy = 0.725 samples = 5517 value = [1113, 4404] class = y 1 12->14

El modelo funciona muy bien tanto con el criterio Gini como Entropy, pero predice mejor con Gini.¶

Detectar importancia de las variables¶

In [65]:
imp_df = pd.DataFrame({
    "Feature Name": X_train.columns,
    "Importance": clf_gini.feature_importances_
})
fi = imp_df.sort_values(by="Importance", ascending=False)

fi2 = fi.head(5)
plt.figure(figsize=(10,8))
sns.barplot(data=fi2, x='Importance', y='Feature Name')
plt.title('Top 5 Feature Importance Each Attributes (Decision Tree)', fontsize=18)
plt.xlabel ('Importance', fontsize=16)
plt.ylabel ('Feature Name', fontsize=16)
plt.show()
In [66]:
import shap
explainer = shap.TreeExplainer(clf_gini)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test)

Randon Forest¶

Instanciar el modelo¶

In [67]:
from sklearn.ensemble import RandomForestClassifier
In [68]:
rfc = RandomForestClassifier(random_state=0)
rfc.fit(X_train, y_train)
Out[68]:
RandomForestClassifier(random_state=0)

Función predicción Test set¶

In [69]:
y_pred = rfc.predict(X_test)
In [70]:
#Verificamos precisión del modelo
print('Precisión en el set de Entrenamiento: {:.2f}'
     .format(rfc.score(X_train, y_train)))
print('Precisión en el set de Test: {:.2f}'
     .format(rfc.score(X_test, y_test)))
Precisión en el set de Entrenamiento: 1.00
Precisión en el set de Test: 0.97

Modelo Random Forest con n_estimators = 100¶

In [71]:
rfc_100 = RandomForestClassifier(n_estimators=100, random_state=0)

# Se fija el modelo al training set

rfc_100.fit(X_train, y_train)
Out[71]:
RandomForestClassifier(random_state=0)
In [72]:
#Prediccoón con test set
y_pred_100 = rfc_100.predict(X_test)
In [73]:
#Verificamos precisión del modelo
print('Model accuracy score with 100 decision-trees : {0:0.4f}'. format(accuracy_score(y_test, y_pred_100)))
Model accuracy score with 100 decision-trees : 0.9673

Encontrar las características importantes para el modelo Random Forest¶

In [74]:
# Se crea un clasificador con n_estimators = 100

clf = RandomForestClassifier(n_estimators=100, random_state=0)


# Se ajusta el modelo al training set

clf.fit(X_train, y_train)
Out[74]:
RandomForestClassifier(random_state=0)
In [75]:
#Score más importantes de las características del modelo 
feature_scores = pd.Series(clf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

feature_scores
Out[75]:
duration          0.423999
euribor3m         0.132732
age               0.065772
nr.employed       0.055495
emp.var.rate      0.043414
job               0.034782
education         0.033938
day_of_week       0.030401
cons.price.idx    0.029918
campaign          0.029391
month             0.029384
cons.conf.idx     0.025160
marital           0.017634
housing           0.013558
contact           0.013079
default           0.011368
loan              0.009976
pdays             0.000000
previous          0.000000
poutcome          0.000000
dtype: float64

La característica más importante es Duration, y las de menor importancia con diferencia es Poutcome, Previous; Pdays. (Se podría eliminar)

Graficar las características¶

In [76]:
sns.barplot(x=feature_scores, y=feature_scores.index)

# Agregamos etiquetas al gráfico 

plt.xlabel('Score de las características')

plt.ylabel('Características')

# Agregamos título al gráfico

plt.title("Características más importantes")

plt.show()

Un plus. Algoritmo de ensamble (Boosting) XGBoost¶

Instalar usando la función pip xgboost¶

Boosting es un enfoque de Machine Learning basado en la idea de crear una regla de predicción altamente precisa combinando muchas reglas relativamente débiles e imprecisas.¶

El Boosting asume la disponibilidad de un algoritmo de aprendizaje base o débil que, dado ejemplos de entrenamiento etiquetados, produce un clasificador base o débil.¶

La idea fundamental detrás de Boosting es elegir conjuntos de entrenamiento para el algoritmo de aprendizaje base de tal manera que lo obligue a inferir algo nuevo sobre los datos cada vez que se lo ejecute.¶

In [77]:
pip install xgboost
Requirement already satisfied: xgboost in c:\users\l03534795\anaconda3\lib\site-packages (1.7.5)
Requirement already satisfied: scipy in c:\users\l03534795\anaconda3\lib\site-packages (from xgboost) (1.9.1)
Requirement already satisfied: numpy in c:\users\l03534795\anaconda3\lib\site-packages (from xgboost) (1.21.5)
Note: you may need to restart the kernel to use updated packages.
In [78]:
from xgboost import XGBClassifier
In [165]:
#n_estimators=100 specifies how many times to go through the modeling cycle described above.
# You can experiment with your dataset to find the ideal. Typical values range from 100-1000, 
# though this depends a lot on the learning rate discussed below.
#learning_rate parameter can be set to control the weighting of new trees added to the model.

xgb = XGBClassifier(n_estimators=1000, learning_rate=0.001)
In [166]:
xgb.fit(X_train, y_train)
Out[166]:
XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=0.001, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=1000, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...)
In [167]:
y_pred = xgb.predict(X_test)
In [168]:
print('Precisión en el set de Entrenamiento: {:.2f}'
     .format(xgb.score(X_train, y_train)))
print('Precisión en el set de Test: {:.2f}'
     .format(xgb.score(X_test, y_test)))
Precisión en el set de Entrenamiento: 0.89
Precisión en el set de Test: 0.89
In [169]:
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, jaccard_score
print('F-1 Score : ',(f1_score(y_test, y_pred, average='micro')))
print('Precision Score : ',(precision_score(y_test, y_pred, average='micro')))
print('Recall Score : ',(recall_score(y_test, y_pred, average='micro')))
print('Jaccard Score : ',(jaccard_score(y_test, y_pred, average='micro')))
F-1 Score :  0.8898799543960835
Precision Score :  0.8898799543960835
Recall Score :  0.8898799543960835
Jaccard Score :  0.8016069594635413
In [170]:
from sklearn.metrics import classification_report, confusion_matrix, roc_curve
print (classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.95      0.86      0.90      8875
           1       0.82      0.94      0.87      6036

    accuracy                           0.89     14911
   macro avg       0.89      0.90      0.89     14911
weighted avg       0.90      0.89      0.89     14911

In [171]:
# Matriz de confusión
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
cm_matrix = pd.DataFrame(data=cm, columns=['Actual Positive:1', 'Actual Negative:0'], 
                                 index=['Predict Positive:1', 'Predict Negative:0'])

sns.heatmap(cm_matrix, annot=True, fmt='d', cmap='Reds')
plt.figure(figsize=(9,9))
Out[171]:
<Figure size 900x900 with 0 Axes>
<Figure size 900x900 with 0 Axes>
In [172]:
#Score más importantes de las características del modelo 
feature_scores = pd.Series(xgb.feature_importances_, index=X_train.columns).sort_values(ascending=False)

feature_scores
Out[172]:
duration          0.225223
euribor3m         0.217821
emp.var.rate      0.148687
nr.employed       0.147049
month             0.127078
default           0.021087
day_of_week       0.015067
cons.conf.idx     0.014264
contact           0.013764
education         0.012528
cons.price.idx    0.011885
campaign          0.011836
age               0.007686
housing           0.007615
loan              0.007081
marital           0.006598
job               0.004735
pdays             0.000000
previous          0.000000
poutcome          0.000000
dtype: float32

FIN¶